New quant strategy / FTYPE IQ3_XL 4bpw #9855
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Intermediary FTYPE mixed between IQ3_M and IQ4_XS at 4bpw.
Transposed loyally from @ikawrakow's new FTYPE IQ3_KL on ik_llama.cpp, so it can be trusted, except for attn_k.weight that I chose to bump to IQ4_XS when GQA is present, because it makes no sense whatsoever in such case to have a key head smaller than attn_output.weight or half of the FFNs.
The XL suffix is chosen in the eventuality of the emergence of an "IQ3_M/L" or "IQ4_XXS" GGML_TYPE close to 4bpw, with a related FTYPE. In such case, the proposed mixed FTYPE could be replaced.
This very FTYPE answers to the demand of many users, me included, to have an intermediary between IQ3_M and IQ4_XS, separated by 0.5bpw, and preventing the users to get the best fully offloadable quality in many cases (e.g : 70b on 36GiB VRAM, 123b on 64GiB VRAM).
In an ulterior PR that I can provide, this in order to not multiply the FTYPES, IQ2_M could be eliminated, IQ2_S elevated to replace it and make sense in term of nomenclature (currently, IQ2_S is an IQ2_XS+, and IQ2_M uses IQ2_S GGML_TYPE), and IQ2_XS having a little boost to compensate for the potential disappearance of the old IQ2_S FTYPE, which is tbh first in line to be sacrificed. Among other FTYPE elimination choices, either IQ3_S, either IQ3_M could also be sent in retirement.
Note : I allow myself to PR this because I amused myself quite extensively with the quant strategies, and already revamped for my own use LCCP's quant strategies, some of it being already PRed on this repo as a demo.
In IK's graph, IQ3_XL's dot should like like IQ3_KL, with a lil bit more weight and a lil bit more ppl. It's clearly viable, although a bit different from my own custom quant strategies (I use different use_more_bits formulas for myself, and bump more the attn_v and attn_k, in Q6_K and Q5_K respectively).
Ref : ikawrakow/ik_llama.cpp@b30c9e1